{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了让Python能够高效率处理表格数据，我们使用一个非常优秀的数据处理框架Pandas。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后我们把loans.csv里面的内容全部读取出来，存入到一个叫做df的变量里面。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# input your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们看看df前几行，以确认数据读取无误。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "统计一下总行数，看是不是读取完整。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数量正确，证明数据都读取进来了。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "还记得吧，每一条数据的最后一列`safe_loans`是个标记，告诉我们之前发放的这笔贷款是否安全。我们把这种标记叫做目标(target)，把前面的所有列叫做“特征”(features)。这些术语你现在记不住没关系，因为以后会反复遇到。自然强化记忆。\n",
    "\n",
    "下面我们就分别把特征和目标提取出来。依照机器学习领域的习惯，我们把特征叫做X，目标叫做y。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = df.drop('safe_loans', axis=1)\n",
    "# input your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们看一下X的形状："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "除了最后一列，其他行列都在。符合我们的预期。我们再看看“目标”列。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里的逗号后面没有数字，指的是只有1列。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们来看看X的前几列。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注意这里有一个问题。Python下面做决策树的时候，每一个特征都应该是数值类型的。但是我们一眼就可以看出，grade, `sub_grade`, `home_ownership`等列的取值都是类别(categorical)型。所以，必须经过一步转换，把这些类别都映射成为某个数值，才能进行下面的步骤。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "from collections import defaultdict\n",
    "d = defaultdict(LabelEncoder)\n",
    "X_trans = X.apply(lambda x: d[x.name].fit_transform(x))\n",
    "X_trans.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里，我们使用了LabelEncoder函数，成功地把类别变成了数值。小测验：在grade列下面，B被映射成了什么数字？\n",
    "\n",
    "答案是1。你答对了吗？\n",
    "\n",
    "下面我们需要做的事情，是把数据分成两部分，分别叫做训练集和测试集。\n",
    "\n",
    "为什么这么折腾？\n",
    "\n",
    "因为有道理。\n",
    "\n",
    "想想看，如果期末考试之前，老师给你一套试题和答案，你把它背了下来。然后考试的时候，从试题里面抽取一部分考。你凭借超人的记忆力获得了100分。请问你学会了这门课的知识了吗？如果给你新的题目，你会不会做呢？答案是不知道。\n",
    "\n",
    "所以考试题目需要和复习题目有区别。同样的道理，我们用数据生成了决策树，这棵决策树肯定对见过的数据处理得很完美。可是它能否推广到新的数据上呢？这才是我们真正关心的。就如同在本例中，你的公司关心的，不是以前的贷款该不该贷。而是如何处理今后遇到的新贷款申请。\n",
    "\n",
    "把数据随机拆分，在Python里只需要2条语句就够了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "# input your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们看看训练数据集的形状："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_train.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "测试集呢？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "至此，一切数据准备工作都已就绪。我们开始呼唤Python中的scikit-learn软件包。决策树的模型，已经集成在内，直接调用就可以，非常方便。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import tree\n",
    "clf = tree.DecisionTreeClassifier(max_depth=3)\n",
    "# input your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "好了，你要的决策树已经生成完了。\n",
    "\n",
    "对，就是这么简单。任性吧？\n",
    "\n",
    "可是，你怎么知道决策树是个什么样子呢？\n",
    "\n",
    "这个……好吧，咱们把这棵决策树画出来吧。注意这一段语句内容较多。以后有机会咱们再详细介绍。此处你把它直接抄进去执行就可以了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "with open(\"safe-loans.dot\", 'w') as f:\n",
    "     f = tree.export_graphviz(clf,\n",
    "                              out_file=f,\n",
    "                              max_depth = 3,\n",
    "                              impurity = True,\n",
    "                              feature_names = list(X_train),\n",
    "                              class_names = ['not safe', 'safe'],\n",
    "                              rounded = True,\n",
    "                              filled= True )\n",
    "\n",
    "from subprocess import check_call\n",
    "check_call(['dot','-Tpng','safe-loans.dot','-o','safe-loans.png'])\n",
    "\n",
    "from IPython.display import Image as PImage\n",
    "from PIL import Image, ImageDraw, ImageFont\n",
    "img = Image.open(\"safe-loans.png\")\n",
    "# input your code here:\n",
    "\n",
    "img.save('output.png')\n",
    "PImage(\"output.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "你是不是跟我第一次看到决策树的可视化结果一样，惊诧了？\n",
    "\n",
    "我们只让Python生成了一棵简单的决策树（深度只有3层而已），但是Python已经尽职尽责地帮我们考虑到了各种变量对最终决策结果的影响。\n",
    "\n",
    "欣喜若狂的你，在悄悄背诵什么？你说想把这棵决策树的判断条件背下来，然后去做贷款风险判断？\n",
    "\n",
    "省省吧。都什么时代了，还这么喜欢背诵？\n",
    "\n",
    "以后的决策，电脑可以自动化帮你完成了。你不信？\n",
    "\n",
    "我们随便从测试集里面找一条数据出来。让电脑用决策树帮我们判断一下看看。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "test_rec = X_test.iloc[1,:]\n",
    "# input your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "电脑告诉我们，这笔贷款是安全的。实际情况如何呢？我们来验证一下。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_test.iloc[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "经验证，判断无误。\n",
    "\n",
    "但是我们不能用孤证来说明问题。下面我们验证一下，根据我们训练得来的决策树模型，贷款风险类别判断准确率究竟有多高。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "# input your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "你可能会有些失望——忙活了半天，怎么才60%多的准确率？刚及格而已嘛。\n",
    "\n",
    "不要灰心。因为在整个儿的机器学习过程中，你用的都是缺省值，根本就没有来得及做一个重要的工作——优化。想想看，你买一台新手机，自己还得设置半天，不是吗？面对公司主营业务（贷款），你用的竟然只是没有优化的缺省模型，应该吗？可即便这样，准确率也已经超过了及格线。\n",
    "\n",
    "关于优化的问题，以后有机会咱们详细展开来聊。\n",
    "\n",
    "掌握了用Python做决策树的方法，你是否有了冲动，希望赶紧去找个数据集来实际动手实践呢？\n",
    "\n",
    "有啊。那还不赶紧去？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
